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ABSTRACT 

We develop and demonstrate a probabilistic method for classifying rare objects in sur- 
veys with the particular goal of building very pure samples. It works by modifying the 
output probabilities from a classifier so as to accommodate our expectation (priors) 
concerning the relative frequencies of different classes of objects. We demonstrate our 
method using the Discrete Source Classifier, a supervised classifier currently based on 
Support Vector Machines, which we are developing in preparation for the Gaia data 
analysis. DSC classifies objects using their very low resolution optical spectra. We 
look in detail at the problem of quasar classification, because identification of a pure 
quasar sample is necessary to define the Gaia astrometric reference frame. By varying 
a posterior probability threshold in DSC we can trade off sample completeness and 
contamination. We show, using our simulated data, that it is possible to achieve a 
pure sample of quasars (upper limit on contamination of 1 in 40 000) with a complete- 
ness of 65% at magnitudes of G=18.5, and 50% at G=20.0, even when quasars have 
a frequency of only 1 in every 2000 objects. The star sample completeness is simul- 
taneously 99% with a contamination of 0.7%. Including parallax and proper motion 
in the classifier barely changes the results. We further show that not accounting for 
class priors in the target population leads to serious misclassifications and poor pre- 
dictions for sample completeness and contamination. We discuss how a classification 
model prior may, or may not, be influenced by the class distribution in the training 
data. Our method controls this prior and so allows a single model to be applied to any 
target population without having to tune the training data and retrain the model. 
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1 INTRODUCTION 

Finding rare objects is hard, for two reasons. First, we expect 
to have to look at many objects before we encounter one, 
and second, we may not even recognise it even when we do. 
The reason for this is that the prior probability that any 
one object is of the rare class, P(C laT e), is very small. So 
even if it has very characteristic features, i.e. the likelihood 
of the data given the rare object, P(Data | C raro ), is high, 
the posterior probability, P(C ra r C | Data), could still be low. 

A survey for rare objects obviously requires a very dis- 
criminating classifier, but we could assist it by modifying 
the class probabilities. If we raised the prior, P(C rar c), we 
are more likely to find the rare objects, but it is inevitable 
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that we will then incorrectly classify other objects as being 
of the rare class. Depending on our goals, there may be a 
satisfactory balance between maximizing the expected num- 
ber of true positives (sample completeness) and minimizing 
the expected number of false positives (sample contamina- 
tion). Here we present a method for achieving an optimal 
balance and a correct prediction of this balance which, al- 
though simple, is not trivial and has important implications. 

We illustrate our method in the context of the problem 
of detecting quasars in the Gaia survey based on their low 
resolution (R ~ 30) optical spectra. Gaia is an all-sky astro- 
metric and spectrophotometric survey complete to G = 20, 
expecting to observe some 10 9 stars, a few million galaxies 
and half a million quasars. Its primary mission is to study 
Galactic structure by measuring the 3D spatial distribu- 
tion and 2D kinematic distribution of stars throughout the 
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Galaxy and correlating these with stellar properties (abun- 
dances, ages etc.) derived from the spectra. With astromet- 
ric accuracies as good as 10 pas, Gaia cannot be externally 
calibrated with an existing catalogue. Instead it must ob- 
serve a large number of quasars over the whole sky with 
which to define its own reference frame. This quasar sample 
must be very clean (low contamination) . (The quasar sample 
is also, of course, of intrinsic interest.) 

We present our Gaia classification model, the Discrete 
Source Classifier (DSC), and report its classification perfor- 
mance using simulated data. We will show how, using our 
probabilty modification approach, we can use this to build 
pure quasar samples, at the (acceptable) loss of sample com- 
pleteness. 

Related work. One of the most comprehensive search 
for quasa rs to date is that done with the SDSS. Richards et 
al. I)2002r ) defined a colour locus in the ugriz space to identify 
objects for spectroscopic follow up and estimated the con- 
tamination rate of their photometric selection to be 34%. Us- 
ing the spectroscopy of a ll poin t sources taken in SDSS stripe 
82, Vanden Berk et al. (120051 ) assessed the completeness of 
the Richards et al. selection at 95.7 % for sources brighter 
than i = 19.1. Later, Richards et al. |2004l ) trained a photo- 
metric classifier (based on kernel density estimation) on a set 
of 22,000 spectroscopically identified quasars and used this 
to build a sample of 100 000 quasars brighter than g = 21.0. 
For the unresolved UV excess quasars in this sample, they 
estimated the completeness to be 94.7% down to <?=19.5, 
with a cont amination of ju st 5% down to </=21.0. Suchkov 
et al. (120051 ) and Ball et al. (120061 ) both use decision trees to 
classify objects in the SDSS photometric catalogue, by train- 
ing on objects with spectroscopic classifications from SDSS. 
Gao et al. (j2QQ8|) used support vector machines and Kd-tree 
nearest neighbours to classify spectroscopically-confirmed 
SDSS and 2MASS objects as quasars or stars using just the 
photometry, with stars and quasars present in equal pro- 
portions. With both methods they obtained contamination 
rates of a few percent (averaged across both classes). 

Outline. We start in section[5]by presenting our general 
prior modification method. In section[3]we present the Gaia- 
DSC classifier and the data used to train and test it. The 
performance of this and the application of our method are 
given in section [4] and discussed in section [5] where we also 
discuss the interpretation of the probabilities and some of 
the limitations of this work. We conclude in section [5] 

Definitions and notation. We use the term class frac- 
tions to mean the relative numbers of objects of each class in 
a data set. The nominal model refers to the classifier "as is" , 
i.e. the output probabilities without any modifications. In 
contrast, in the modified model we have modified these out- 
puts to be appropriate to a situation in which there would 
be very different class fractions, e.g. quasars very rare (sec- 
tion [2]4]). The subscript i refers to true classes and the sub- 
script j to model output classes. j\ ra%n , f\ esi and /, mod refer 
to the class fractions of class i for the training data, test data 
and effective test data respectively. They are normalized, 
E* ft"™ = J2i ft st = E t fr d = 1- This effective test 
data set is one which we don't actually have, but reflects a 
target population with different class fractions, e.g. quasars 
very rare. We make predictions of the performance on this by 
using the actual test data but by modifying the performance 
equations (section [23} . In the figures especially, lower case 



class names, e.g. quasar, denote estimated classes and up- 
per case names, e.g. STAR, true classes. Thus P(quasar |STAR) 
means "the DSC-assigned quasar probability given that the 
source is truly a STAR". This refers to DSC outputs for 
objects with known classes (e.g. in a test set). This is not 
to be confused with the notation for the DSC outputs in 
the general case, P(Cj\x n ,9), which means "the probabil- 
ity which DSC model 6 assigns to class Cj for object x n " . 
"C&C" stands for "completeness and contamination" . 



2 USING AND MODIFYING PROBABILITIES 
FOR CLASSIFICATION 

2.1 The issue of priors 

The probabilities and assigned classes from any classifier 
depend not only on the evidence in the data, but also on 
the prior probabilities. In this context, the prior is the class 
probabilities before you look at the data. Priors are always 
present (whether you accept the Bayesian philosophy or 
not), but they are not explicit in many models, so we may 
not know what they are. This is a problem, because if the 
model priors are inappropriate (e.g. equal probability of star 
and quasar) this will translate into poor posterior probabili- 
ties and thus class assignments. The method we present here 
allows us to replace priors (even if implicit) with something 
more appropriate, e.g. quasars much rarer than stars. 

2.2 Building samples and assessing completeness 
and contamination 

DSC assigns to every data vector (e.g. spectrum), x n , a 
probability, P(quasar), that it is a quasar, and likewise 
for the other classes. The classes are exclusive and ex- 
haustive, so the probabilities sum to unity. The classifier 
is not perfect, so generally P(quasar|QUASAR) < 1 and 
P(quasar|not QUASAR) > 0. 

We could simply assign an object to that class for which 
the DSC output is largest (the most probable class). How- 
ever, if the different output probabilities for an object were 
all similar this would not be a confident classification. A 
more secure basis for building pure samples is to select ob- 
jects only if their probability is above some threshold: the 
higher this is, the lower the sample contamination, but the 
smaller also its completeness. 

We assess the performance using a labelled test set. To 
build a sample for class class, we select objects which have 
a DSC output P(class) > P t , where P t is a user-defined 
threshold (and may be different for each class). The sample 
completeness is the number of objects in the sample which 
are truly of that class divided by the total number of objects 
of that class in the test set. The contamination is the 
number of objects in the sample which are truly of other 
classes divided by the total number of objects in the sample. 
For class = j 

completeness ■ = n '~ J,: ' 
3 Ni 

contamination j — ^~^f 3 — — (1) 

Ei 

where m,j is the number of objects of (true) class i in the 
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sample selected for (output) class j, and Ni is the total num- 
ber of objects of (true) class i in the test set. (There are var- 
ious other diagnostics one could use, such as the ROC curve 
or precision rate.) These equations give us predictions of the 
completeness and contamination for a new (unlabelled) data 
set, insofar as we believe it to have the same class fractions 
as the training data (that's our prior). 



2.3 Model-based class priors 

All classifiers include a prior, whether explicit or not. We 
need to know this prior for two reasons. First, we would like 
to know what assumption our model is actually making (and 
not what we suspect it is making) . Second, we would like to 
change this prior to something which is appropriate to the 
problem at hand. Here we discuss what the prior is, how the 
training data may influence it, and how to calculate it post 
hoc from a trained model. 



2.3.1 Bayes and priors 

The outputs from a trained classifier when presented with 



data x n are P(Cj 



), the probability that the data is of 



class Cj given the data and the model, 9. This latter quantity 
reflects both the architecture of the model and the training 
data set used to fix its internal parameters. We can think 
of this output as a posterior probability and write it using 
Bayes' theorem 



P{Cj\x n , 



p(x n \c 3 ,e)P(c 3 \e) 

P{Xn\0) 



(2) 



The term P(x n \Cj ,8) is the likelihood of the data given the 
class and model. The term P(Cj\9) is the prior probability 
that, given our model, an object is of class Cj. Bayesian 
statistics deals with updating probabilities based on new 
data: the prior reflects our knowledge (based on some other 
data) before we look at the new data. In the present con- 
text, the prior P(Ck = quasar|6>) is the probability that any 
one object in our survey is a quasar, before we look at its 
spectrum. We always have some prior information, e.g. with 
Gaia the fact that it is an all sky survey to G=20.0. If we 
know them, we could even treat the magnitude and Galactic 
latitude as prior information. 



2.3.2 What are the classifier priors and how does the 
training data influence them? 

Given that we can make the decomposition in equation [2] it 
follows that all classification models must possess a prior on 
the class probabilities. In some models, for example linear 
discriminant analysis or Gaussian mixture models, this prior 
is explict and so can be controlled. But in many others, 
such as neural networks or support vector machines, it is not 
explicit. (See a standard text on ma chine learning for details 
of these methods, e.g. Hastie et al. l|200ll V) In particular, it 
may depend on the class fractions in the training data. 

Take, for example, a standard neural network regres- 
sion model which is trained by minimizing an error function 
over the whole data set. If we trained this on 1000 stars and 
just one quasar, it will learn to recognise stars much bet- 
ter than quasars, because in minimizing the error it hardly 



has to worry about fitting the lone quasar. If we changed 
the training data (class fractions), the model and thus the 
classifications would change. Other regression models are in- 
fluenced by the class fractions in different ways, or not at all. 
Given this dependence on the model and data, we refer to 
the priors as model-based priors, and the notation P(Cj\9) 
reminds us of this. 

This issue of class fractions influencing the model per- 
formance is well-known in the machine learning literature, 
where it is referred to as the problem of "class imbalance" 
or "unbalanced data sets" . It has been demonstrated to in- 
fluence neural networks, support vector machines and clas- 
sification trees (e.g. Shin & Cho 2003, Visa & Ralescu 2005, 
Weiss 2004). But how, exactly, do the class fractions affect 
the classifications and, more specifically, the model-based 
priors? We might think that in the above example the ratio 
of the star to quasar prior probabilities implicit in the model 
is 1000 to 1, but this is generally not the cas^E because it 
depends on the model and how it is trained. The bottom 
line is that, in general, the model-based prior is not equal 
to the class fractions in the training data. 



2.3.3 Calculating the model-based priors 

We can calculate the model-based priors, P(Cj\9), directly 
from the trained model via the marginalization equation 



n=N t est 

HCj\9)= P{C ] \x n ,6)P{x n \9) 



(3) 



where the sum is taken over all Ntest objects in the test 
data set. The first term in the sum is the posterior prob- 
ability. The second term is the probability that we draw 
object x n from the test data set, which is 1 /Ntest- Hence 
the prior is simply the average of the posterior probabili- 
ties. It might seem strange that the prior can be calculated 
from the posterior. Yet because the sum is over all objects 
in the test set, regardless of their true class, we can think 
of this summation as eliminating information on individual 
objects, leaving us with what the model probability is for 
class Cj in the absence of specific data. This is the prior. 
If we had three classes equally represented in the data, and 
the classifier were perfect, the prior for each class would be 
1/3. Different class fractions and non-perfect classifiers will 
give different results^ 

In section [4] we will compare the model-based priors 
with the training data class fractions. 



1 If this were the case, then we might be tempted to address the 
class imbalance problem by changing the training data to have 
class fractions equal to our priors. But if we had just 10 000 train- 
ing vectors, we could then have only 10 quasars, making it hard 
for the classifier to correctly classify quasars. We demonstrate this 
later (section !4.6j l, 

2 Note that the prior calculation assumes the test set has the 
same class fractions as the training set. Interestingly, because the 
true classes don't appear in the equation, we don't actually need 
labels on the individual objects in order to calculate the model- 
based prior. 
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2.4 Modifying the posterior probabilities to 

account for modified class fractions or priors 

Typically we train a classification algorithm on more or less 
equal class fractions, because this helps the algorithm to 
reliably identify the class boundaries. This is our nominal 
model, Q nom , Yet the target population to which we want 
to apply our classifier may have different class fractions 
(e.g. quasars very rare), in which case the classifier may 
make poor predictions. The underlying issue here is that 
the model-based priors of our nominal model differ from our 
target priors. 

To correct for this we would ideally just replace the orig- 
inal (nominal) prior with our modified prior. But we cannot 
do this directly because the model priors are not explicit. In- 
stead we modify the output (posterior) probabilities to give 
us the "modified model", Q mod . From inspection of equa- 
tion [2] we see we may achieve this by dividing the posterior 
by the nominal prior and then multiplying it by the modi- 
fied prior. The nominal priors can be calculated using equa- 
tion[3] But we cannot use this equation to calculate the mod- 
ified priors because we don't yet have the posteriors from 
the modified model. We therefore approximate the modified 
priors using the class fractions. Let P nom (Cj\x,9 nom ) be 
the output (posterior) probability from the nominal model, 
i.e. that trained on the data set with class fractions fj ram . 
The modified probability appropriate for a target population 
with class fractions /" lo we define as 



(Cj \x n , 



: Q>n P (Cj \x n , 



rmod 
J i=j 
f train 
J i=j 



(4) 



where a n is a normalization factor which ensures that 



mod 



(Cj\x„, 



qmod\ 



1 for each object n. The factor 



lfT=T is equivalent to changing the prior in equation[2] 
and has the expected impact on the posterior. 

Following the discussion in section [!Q1 we do not neces- 
sarily expect the class fractions to be a good proxy for the 
priors, but for now it's the best we can do. (We will later 
check how good the approximation is.) 

We illustrate the above with a simple example, a two- 
class model trained on equal numbers of stars and quasars, 
i.e. j tram — (0.5,0.5). Let's suppose an object is classified 
with the nominal model to give P nom (C s tar\x n , 9 nom ) = 0.2 
and therefore P nom (C qua sar\x n , 8 nom ) = 0.8. If we want 
jmod _ (o.9,0.1) (stars nine times more common than 
quasars), then from equation 3] we get 



P mod (C at ar\x n ,e mod ) = a n 0.2 x 0.9/0.5 = a„0.36 = 0.69 
P mod (C quasa r\x n , 6 mod ) = a n 0.8 x 0.1/0.5 = a„ 0.16 = 0.31 



the final values being those after normalization. We see how 
the prior can have a significant effect on our inference. 

It is tempting to think of these modified probabilites as 
those we would have obtained if we had trained the model 
on the modified class fractions. But this is generally not 
true, because what we are doing in equation [4] is changing 
the priors (via the proxy of the class fractions) and this 
is not equivalent to changing the training data (as we will 
demonstrate later). 



2.5 Predicting performance and model-based 
priors for an effective test set using the 
nominal test set 

2.5.1 The completeness and contamination calculation 

The equations for the completeness and contamination 
(eqn. [TJ depend on the actual number of objects correctly 
and incorrectly classified and were calculated from the test 
data set, which by definition has class fractions f\ Est ■ How- 
ever, we expect to apply the modified model to a target 
population with different class fractions (this was the whole 
point of modifying it). To estimate the C&C, it is not neces- 
sary to create a new test data set. Instead, we can modify the 
equations to reflect the target class fractions, f mod } namely 



completeness. 



/■"""' 

J i=j 



N, 



N, 



(5) 



This effectively changes the number of objects in each true 
class. The quantities n it j and Ni refer to the number of ob- 
jects in samples obtained from the modified model applied 
to the actual test data set. Note that the equation for the 
completeness remains unchanged, yet the value of the com- 
pleteness will generally change because the posterior prob- 
abilities have changed and so ni=jj changes. We emphasise 
that the above equation is independent of the procedure to 
modify the probabilities described in section 12.41 It is only 
to be used to make predictions for an effective test set. To 
calculate the completeness and contamination for an actual 
test set which already has the modified class fractions, we 
use equation [T] 



2.5.2 The model-based prior calculation 

Once we have the posteriors from the modified model we 
can calculate its model-based priors from the test set using 
equation [3] But just as for the C&C calculation, we must 
change the effective number of objects in each true class to 
represent the modified class fractions. Thus in equation [3] 
we should use 



P(x n \9 mod ) 



rn 
J i 



ft 



(6) 



where x n , being an object in the (labelled) test set, has 
known true class, i. If a class is ten times rarer in the mod- 
ified population, then P(x„\8 rnod ) is reduced by a factor of 
ten compared to before. 

In principle we could now use these model-based priors 
to recalculate the posteriors for the modified model (without 
using the approximation in equation [4}, but we don't do this 
in the present article. Rather we just report them as a check 
on what the prior really is. 



3 THE DISCRETE SOURCE CLASSIFIER 

The Discrete Source Classifier (DSC) is the algorithm we 
are developing as part of our work in the Gaia Data Pro- 
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cessing and Analysis Consortium (DPAC) (O'Mullane et al. 

DSC will classify objects based primarily on low 
resolution prism spectra (so-called "BP/RP" spectra), but 
supplemented with parallaxes and proper motions and pho- 
tometric variability where available. Here we restrict our- 
selves to the three classes "star", "galaxy" and "quasar". 
After DSC in the pipeline come several algorithms for es- 
timating astrophysical parameters, e.g. stellar parameters. 
For m ore details of the classification system, see Bailer- Jones 
|2005l ). 

Gaia detects objects in a very broad band, G, which 
has roughly the same wavelength range as the BP/RP in- 
strument. In common with the Gaia detection strategy, DSC 
only operates on point sources (i.e. makes no use of mor- 
phological information) . In this paper we present results for 
simulations at both G=20.0 and G=18.5. 

Here we describe the underlying classifier, the source 
libraries and the Gaia simulated data used to train and test 
DSC. 



3.1 The SVM classification algorithm 

DSC is presently based on a S upport V ector Machine (SVM) 
(e.g. Cortes & Vapnik (|l995l ). Burg es (1 19981 ) 1. using the lib- 
SVM implmentation (Chang & Lin (|200lh ). An SVM is a su- 
pervised learning algorithm which finds the boundary which 
"optimally" separates two classes of objects. In contrast to 
many other classifiers, SVMs are defined only by those train- 
ing objects which lie close to the boundary, the so-called 
support vectors. The basic algorithm is linear, but by using 
the "kernel trick" to implicitly map data into a higher di- 
mensional space, it achieves a nonlinear mapping between 
the input data (the spectrum) and the output classes. The 
SVM is trained by minimizing the number of misclassifica- 
tions and using a regularizer for complexity control. 

SVMs are fundamentally non-probabilistic: they simply 
define a boundary and assign a class (Ci or C2) depending 
on which side of the boundary an object falls. Probabilities 
may be derived from the distance, /, of an object from the 
boundary, in the present case by fitting a sigmoidal function 
to P(Cj = Ci\f) (Piatt |200d )1. 

The basic SVM is a two-class classifier. libSVM actually 
trains K(K — l)/2 one-against-one classifiers (for a K class 
problem) an d use s the pairwise coupling method described 
in Wu et al. (|2004T ) (their algorithm 2) to produce normalized 
probabilities for K classes. 

One of the advantages of SVMs over many other nonlin- 
ear classifiers (e.g. neural networks) is that the SVM objec- 
tive function is convex, so the training has a unique solution. 
However, the model has two hyperparameters, the regular- 
ization parameter (the "cost") and the scale length in the 
radial basis function (RBF) kernel we use. We optimize these 
( "tune" ) using a Nelder-Mead algorithm. As this involves re- 
training the SVM for each combination of hyperparameters, 
it is a time-consuming process. 

The main concepts presented in this paper are not de- 
pendent on the choice of classifier. SVM is not perfect, not 
least because its underlying philosophy of not modelling the 
data distribution means that probabilties do not arise natu- 
rally. It is, nonetheless, a competitive model in terms of clas- 
sification error, and in tests we have done it has performed as 
well as or better than several alternatives, including multi- 
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Figure 1. The distribution of stellar parameters in the training 
and testing data sets 



layer perceptrons, RBF network s, boo sted trees and mixture 
models (Elting & Bailer- Jones (|2007l 'l). 



3.2 Source spectral libraries 

Libraries of spectra are being assembled or created for Gaia 
purposes. Our s tellar libraries are based on the Base l (Leje - 
une et al. i|l997t )') and MARCS (e.g. Gustafsson et al. (|2008h ) 
libraries plus an OB star library from J.-C. Bouret. To- 
gether these span the full range of effective temperature, 
metallicity and surface gravit y. Th e galaxy library is de- 
scribe d by T salmantza et al. (20Q8|) (see also Tsalamantza 
et al. (2007)). It is based on synthetic spectra derived from 
galaxy evolution models with several astrophysical param- 
eters, in particular those which describe the star formation 
history and the redshift. The quasar library is also synthetic, 
with each spectrum uniquely defined by the three param- 
eters continuum slope, a, emission li ne eq uivalent width, 
EW, and redshift, 2 (Claeskens et al. (|2006l )). Their values 
are chosen at random from a distribution which is flat in 
redshift over the range 0-5.5, flat in a from —4 to +3 and 
exponentially decreasing in EW from to 100 000 A. We 
further simulated the galaxy and stellar spectra at random 
values of interstellar extinction over the range Ay = 0-10 
(with fixed extinction law Rv = 3.1). The main stellar pa- 
rameters which influence the potential contamination with 
the quasars are T e g and Ay, their distribution is plotted in 
Fig. [T] The distributions may be a little artificial (the three 
"steps" reflect the three libraries), but they are sufficient for 
the purpose of this article. 

This data set is obviously rather limited (e.g. purely 
synthetic data, no extinction in the quasars, no emission 
line stars, no strong galaxy emission) . Our main goal at this 
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Figure 2. Random selection of five stars (blue solid lines) and 
five quasars (red dashed lines) (G=18.5, noise included). The first 
46 pixels are the blue channel (Blue Prism, BP), the remaining 
40 the red channel (Red Prism, RP), separated by a dashed line. 

stage is not to produce a data set tuned to the real Gaia 
sky. We discuss the issue of training sets more in section [S] 

3.3 Gaia simulated spectra 

To build and test the processing pipeline, the DPA C has 
built a sophisticated data simulator (Luri et al. (2005)). For 
each object in the spectral libraries we simulate its BP/RP 
spectrum, parallax and proper motion, including all sources 
of random error. Gaia observes every part of the sky be- 
tween 40 and 200 times over five years (depending on eclip- 
tic latitude). The average number of spectra for an object 
in an end-of-mission stack is 72, which we use here. Our 
simula ted da ta are the so-called cycle 3 data (Sordo & Val- 
lenari (|2008l |). 

Gaia prism spectra are obtained in two channels, one 
for the blue called BP and one for the red called RP, with 
cut-offs defined by the (silver) mirror response in the blue, 
CCD QE in the red, and bandpass niters in the middle. Af- 
ter removal of low sensitivity regions, BP ranges from 338 
to 998 nm over 46 pixels and RP from 646 to 1082 nm over 
40 pixels. We use all 86 pixels as separate inputs to the clas- 
sifier (we do not attempt to combine the wavelength over- 
lapped region). The dispersion varies strongly with wave- 
length from 3 nm/pixel (blue end) to 30 nm/pixel (red 
end). The instrumental point spread function (PSF) is sig- 
nificantly broader than the pixel size, and in one conserva- 
tive estimate there are a re only 18 independent measures 
in BP/RP (Brown (2006])). Example spectra are shown in 
Fig. [2] The dominant noise source is Poisson noise from the 
source, so the SNR is approximately the square root of the 
flux. (At G=20, the fluxes are four times smaller and the 
SNR is about half.) The corresponding library spectra (i.e. 
prior to being put through the Gaia instrument simulator) 
are shown in Fig. [3] The original stellar library spectra are 
calculated at a resolution of 0.1 nm and show many absorp- 
tion lines, so to avoid confusion, we just plot every 20 th pixel. 
Note how much information is lost - especially concerning 
the spectral lines - in BP/RP. 



o 




300 400 500 600 700 800 900 1000 
wavelength / nm 



o 




300 400 500 600 700 800 900 1000 
wavelength / nm 



Figure 3. The (noise-free) library spectra for the stars (top 
panel) and quasars (bottom panel) shown in Fig. [2] The pho- 
ton flux has been normalized in each case to have a maximum of 
1.0. 

3.4 Simulated astrometry 

We included the astrometry in our classifiers in the expecta- 
tion that it could help distingiush galactic and extragalactic 
objects. Fig. [4] shows the distribution of the simulated as- 
trometry (with noise) for G=18.5. Astrometry for the extra- 
galactic objects is just zero plus noise. Note that this causes 
negative parallaxes (which are preserved in the Gaia astro- 
metric data processing for model fit assessment purposes). 
Stellar parallaxes, vj, are simulated in a way to ensure con- 
sistency with the apparent and absolute magniutudes and 
stellar parameters. The stellar proper motions are drawn at 
random from a zero mean Gaussian with standard devia- 
tion lOzu/yr, which simiulates a distance-independent space 
velocity distribution with a standard deviation of 50km/s. 

At G=18.5, the Gaia parallax error is typically 0.1 mas 
and the proper motion error around 0.14 mas/yr (both dom- 
inated by source photon statistics). The errors are twice as 
large at G=20.0. 

3.5 Train and test data 

The input data for our classifiers is BP/RP plus astrometry 
(although it turns out that inclusion of astrometry hardly 
improves the results) . Models are trained and tested on noisy 
data and a single G-band magnitude is used in each experi- 
ment. 

Train and test data are drawn at random (without re- 
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Figure 4. Distribution of the simulated astrometry. 
Bluc/solid=star, red/dashed=quasar, black/dotted=galaxy 
(which is indistinguishable from the quasars). 



placement) from a common data set. Unless stated other- 
wise, the training data set comprises 5000 objects from each 
of the three classes and the test data 60 000 objects of each 
class. The reason for the large number in the test set will 
become clear in section I4.3I 



Table 1. Confusion matrix for class assignments from maximum 
probability. Each row corresponds to a true class and sums to 100%. 
Nominal priors, G=18.5 





galaxy 


quasar 


star 


GALAXY 


99.36 


0.19 


0.45 


QUASAR 


1.03 


96.04 


2.94 


STAR 


0.49 


0.88 


98.63 
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Figure 5. Histograms of DSC outputs for each class showing 
how confident DSC is of identifying each class. Nominal priors, 
G=18.5 



4 RESULTS 

4.1 Overview of the experiments and explanation 
of the figures 

We present results for three experiments 

(1) objects at G=18.5, 

(2) as (1), but in which quasars with emission line equiv- 
alent widths less than 5000 A have been removed from the 
training data set, leaving 2901 quasars (the test set is un- 
changed) , 

(3) as (2), but for G=20.0. 

We show results for both the "nominal" and "modified" 
cases. In the nominal cases, the class fractions are those 
given by the data sets themselves, namely f traln = f test = 
(1, 1, 1) for (star, quasar, galaxy) respectively, except for the 
training data in cases 2 and 3 which has f traln = (1, 0.58, 1). 
(We list here the class fractions prior to normalization.) In 
the modified cases, the DSC output probabilities are mod- 
ified (as described in section I2.4|l to accommodate differ- 
ent expected class fractions (priors) . Quasars are made 1000 
times rarer, so f mod = (1, 1/1000, 1). This is the order of 
the star/quasar ratio we expect for Gaia at G=20. (In com- 
parison, a larger ratio is found in SDSS - as high as 0.015 at 



g — 18.5 depending on Galactic latitude - although this es- 
timate probabl y suffer s from stellar incompleteness (Bailer- 
Jones & Smith (|2008al )U Galaxies will also be rare, so in re- 
ality we should also modify their class fraction. Yet we inten- 
tionally modify the class fractions just for a single class here 
in order to better illustrate our prior modification method. 

Results are presented in three ways. The first is a con- 
fusion matrix, in which an object is assigned to the class 
with the largest DSC output (most probable class) and the 
table shows the correct classifications (on-diagonal values) 
and misclassification values (off-diagonal values) as percent- 
ages. Second, we show histograms of the posterior probabili- 
ties. All histograms are density estimates with unit area and 
always show the actual test data sets (no adjustment in the 
number of plotted objects for the modified cases). We then 
set thresholds, Pt, on the posterior probabilities in order to 
build samples (section I2.2J1 . The third method of illustra- 
tion shows how the sample completeness and contamination 
varies with this threshold for each output class. 

4.2 G=18.5 with all quasars in the training data 

We report only briefly on this case, primarily in order to 
motivate the removal of the low EW quasars in experiments 
(2) and (3). 



8 C.A.L. Bailer- J ones et al. 



star 



quasar 



£= 00 

■9 o 
o 

4— 

9- o 



o 
o 




c oo 

■B d 

o 

4— 

& d 

(0 
(0 



q 
d 




0.0 



0.4 
PJ 

galaxy 



0.8 



0.0 



0.4 
P t 



0.8 



No. in QSO sample 



13 

4— 

S- o 



o 
d 




Q. v 

E 

SS oo 



o 
c 

o 




0.0 



0.4 
P t 



0.8 



0.0 



0.4 
P t 



0.8 



Figure 6. Completeness (blue/thick line) and contamination 
(red/thin line) of a sample as a function of the probability 
threshold. The bottom right panel shows the (logarithm) of the 
actual number of different types of class in the quasar sam- 
ple: stars (blue/solid line); quasars (red/dashed lines); galaxies 
(black/dotted line). Nominal priors, G=18.5 
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Figure 7. Completeness (blue/thick line) and contamination 
(red/thin line) of a sample as a function of the probability 
threshold. The bottom right panel shows the (logarithm) of the 
actual number (thick lines) of different types of class in the 
quasar sample (i.e. in the test set): stars (blue/solid line); quasars 
(red/dashed lines). The thin lines show the corresponding effec- 
tive number of objects for the modified case. Modified priors, 
G=18.5 
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Figure 8. Examples of low EW quasars (red/dashed lines) and 
those stars (blue/solid lines) which are the contaminants in the 
quasar sample in the modified G=18.5 model 



The confusion matrix (Tabled} shows that good classi- 
fication accuracy is achieved with the nominal model. The 
histograms of the posterior probabilities (Fig. [SJ) further 
show that these classifications are achieved with high confi- 
dence. This translates into a satisfactory trade-off in sample 
completeness and contamination at thresholds between 0.5 
and 0.9 (Fig. 01.0 The bottom right-hand panel shows that 
stars rather than galaxies are the main contaminant in the 
quasar sample. If we now proceed to the modified model, 
the C&C curves are very different (Fig. [7}. While we would 
expect higher contamination for a given completeness level 
for quasars (because they are now very rare), we see that 
it is impossible to get a low contamination quasar sample 
at all. The bottom right-hand panel of that plot explains 
why: The effective number of stars in the sample (366) is 
much higher than the effective number of quasars (63), so 
the contamination is high, 366/(366 + 63) = 0.85 (To recall: 
the effective number is the number we would have in a data 
set which had the modified class fractions.) 

Inspection of the contaminants shows that they are pre- 
dominantly cool, highly-reddened stars, with T e e between 
4000 and 8000 K and Av mostly between 8 and 10 mag. 
While the full stellar data set is dominated by stars in 
this temperature range, the extinction distribution peaks to- 
wards lower extinctions (Fig. [TJ, so this tendency for highly 
extincted contaminants is significant. Figs. [8] and [9] plot a 
random selection of these stellar contaminants along with a 
random selection of quasars with lower values of the emission 
line EW parameter. Note how the very broad PSF of BP/RP 
is smoothing out the sharp features in the quasar spectra, 
especially at the red of the two channels where the dispersion 
is lower. The narrower quasar emission lines are not resolved, 
and these objects look more like the smoother stellar spec- 
tra. We hypothesize that if we remove such quasars from 



3 The completeness always decreases monotonically from 1 at 
Pt = (all objects included in sample) to to Pt = 1 (no objects 
in sample). The maximum contamination occurs at Pt = with 
a value depending on the class fractions (for equal class fractions 
in K classes it will be 1/K) and drops to zero at Pt = 1. 
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Figure 9. The (noise-free) input spectra for the stars (top panel) 
and quasars (bottom panel) shown in Fig. [8] The photon flux has 
been normalized in each case to have a maximum of 1.0. 



Table 2. Model-based priors for the nominal, P(Cj \8 nom ), and mod- 
ified, P(Cj\9 mod ), cases for the full training data and the case in 
which low EW quasars have been removed ("nlEW"). For compari- 
son we show the class fractions relevant to he nominal models, /' r£ " n , 
and the modified models f mod 







data 


G 


star 


quasar 


galaxy 


p(Cj\e nom ) 


full 


18.5 


0.3380 


0.3279 


0.3341 


f train 




full 


18.5 


0.3333 


0.3333 


0.3333 


P{Cj\t 




full 


18.5 


0.4965 


0.002514 


0.5010 


fmod 
J i 




full 


18.5 


0.4998 


0.000500 


0.4998 


P(Cj\l 




nlEW 


18.5 


0.367 


0.283 


0.350 


P(Cj\t 




nlEW 


20.0 


0.368 


0.260 


0.372 


jtrain 




nlEW 


both 


0.388 


0.225 


0.388 


P(Cj\t 




nlEW 


18.5 


0.4983 


0.000328 


0.5013 


P(Cj\t 




nlEW 


20.0 


0.4762 


0.000277 


0.5234 


fmod 




nlEW 


both 


0.4998 


0.000500 


0.4998 



our training set, and thus from our definition of quasars, 
then the SVM should not so readily confuse these stars with 
our quasar class. We test this in the next experiment (sec- 
tion fOJ. 

Table [2] lists the model-based priors (section I2.3f) . The 
first line is for the nominal model. The second row gives, 
for comparison, the fraction of objects in each true class, i, 
in the training data. These we may consider as frequentist 
estimates of the model priors, insofar as the frequency dis- 



Table 3. Confusion matrix for class assignments from maximum 
probability. Each row corresponds to a true class and sums to 100%. 
Nominal priors, G=18.5, no low EW quasars in training data 





galaxy 


quasar 


star 


GALAXY 


99.37 


0.00 


0.63 


QUASAR 


4.22 


85.59 


10.19 


STAR 


0.68 


0.13 


99.19 



tribution of the classes dicates these. At least for this SVM 
model with equal class fractions, the model-based priors are 
close to the class fractions. 

The third and fourth lines give the same but for the 
modified model. Now we see that the modified class frac- 
tion, fi %od , for the quasars is not a good proxy for the 
model-based prior. This implies that its use in equation 2] 
will give poor estimates for the true posterior probabilities. 
We could attempt to improve this by an iterative procedure: 
Now that we have the model-based priors, we can recalcu- 
late the model posteriors directly from Bayes' equation ((2| 
- rather than our approximation (equation Q - and then re- 
calculate the model-based priors with equation [3] However, 
we don't do this because in our main experiments (next), 
the discrepancy is not as large. 

4.3 G=18.5 with low EW quasars removed from 
the training data 

Motivated by the results of the previous experiment, we 
removed the low equivalent width quasars (EW< 5000 A) 
from the training data set (2099 of 5000) and re-tuned and 
re-trained the SVM. (The choice of 5000 A is somewhat ar- 
bitrary.) The test set is unchanged. 

4-3.1 The nominal model 

Comparing the confusion matrix (Table [3} to that in the 
previous experiment, we now see that fewer quasars are cor- 
rectly classified, with 10% being misclassified as stars. Yet 
this loss of quasars is balanced by the fact that six times 
fewer stars are now misclassified as quasars (0.13% rather 
than 0.88% previously, or 78 stars rather than 528). This is 
what we wanted to achieve by modifying the training sam- 
ple. 

Note how few galaxies are misclassified as stars and 
quasars, and how this has hardly changed from the pre- 
vious experiment. In all experiements we have performed 
with these data (including many not reported here), the 
galaxies are always classified with high confidence. As we are 
not changing the class fractions for galaxies - they are act- 
ing mostly to make the classification problem for the SVM 
harder - we focus on the stars and quasars from now on. 

We summarze the model confidence (posterior probabil- 
ties) in the histograms in Fig. 1101 We can read several things 
from this: the leading diagonal shows P(class|CLASS), how 
confident the true positives are; the central row shows 
P(class|QUASAR), the probabilities assigned to each class for 
true quasars; the central column shows P(quasar |CLASS), 
the quasar probabilities assigned to objects of each true 
class. We see that the confidences for the correct classes 
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Figure 10. Histograms of the DSC class posterior probabilities, shown for each output class (columns) split for objects of each true 
class (rows). Nominal priors, G=18.5, no low EW quasars in training data 



are still high. However, comparing with the results of the 
previous experiment (Fig. we now see a set of true 
quasars which are now assigned very low probabilities, 
P(quasar | QUASAR). These are predominantly (but not ex- 
clusively) the low EW quasars that were removed from the 
training data. These low probability cases are reflected as 
a dip in the quasar completeness curve (Fig. Hip at low P t . 
These quasars show up in the star sample and we corre- 
spondingly see a higher contamination for the stars. Yet the 
positive trade off from this is a lower quasar sample contam- 
ination. To give numbers: At P t = 0.8 the completeness and 
contamination fractions for stars are 0.976 and 0.071 respec- 
tively and for quasars are 0.8020 and 0.0005. So we see that 
even with the nominal model we can get a very pure quasar 
sample, but at the cost of a rather unpure stellar one. 

4-3.2 The modified model 

We turn now to the modified model in which our prior is that 
quasars are 1000 times rarer. The histograms of the posteri- 
ors are show in Fig. [I2]which we may compared to the nom- 
inal case in Fig. 1101 The main difference can be seen in the 
central plot, where we now observe a significant peak around 
zero probability for P(quasar |QUASAR): the modified model 



has a high barrier to classifying anything as a quasar. This is 
the result of the probability remapping, equation [4] (whereby 
the impact of the normalization should not be ignored) . This 
provides more discrimination between those quasars which, 
in the nominal case, were all assiged an output probability 
of almost 1.0. 

Fig. H3l shows the C&C for the modified model. It is sig- 
nificantly different from the nominal case. The quasar com- 
pleteness is slightly reduced, but we win a much lower con- 
tamination. Indeed, at Pt — 0.11 the contamination drops 
to zero for a sample completeness of 65%. This translates to 
a contamination of less than 1 in 0.65 x 60000 = 39 000. Not 
only is this better than for the nominal model (where the 
contamination only drops to zero at Pt = 1.0), but now the 
star and galaxy contaminations are also much lower. 

The bottom right-hand plot of Fig. [13] shows the actual 
(solid line) and effective (thin line) number of objects in the 
quasar sample. At P t = 0.8 these are 36149 (= 10 4 ' 56 ) and 
54 (= 10 1 ' 73 ) respectively. If we had done this experiment 
with a test set containing, say, only 2000 quasars, then the 
effective number of quasars in the resultant sample would 
only have been 54 x (2000/60 000) = 1.8. This is not enough 
to draw results of any significance and explains why we need 
large test sets for evaluating the modified model. 
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Figure 12. Histograms of the DSC class posterior probabilities, shown for each output class (columns) split for objects of each true 
class (rows). Modified priors, G=18.5, no low EW quasars in training data 



Table 4. Confusion matrix for class assignments achieved by apply- 
ing a probability threshold. Each row gives the percentage of objects 
of each true class assigned to the different output classes (columns). 
As an object may now be assigned to more than one class, the val- 
ues in a row no longer sum to 100%, plus some objects may remain 
unclassified. The thresholds applied are Pt = 0.2 for quasars and 
Pt = 0.8 for stars and galaxies. Modified priors, G=18.5 





galaxy 


quasar 


star 


unclassified 


Effective 












fraction 


GALAXY 


98.97 


0.00 


0.64 


0.73 


1.0 


QUASAR 


6.82 


62.00 


26.37 


8.73 


0.001 


STAR 


0.78 


0.00 


98.69 


1.09 


1.0 



The confusion matrix for this modified case is shown in 
Table |4] As class assignments are now based on thresholds 
rather than maximum probability, and because we can set 
these thresholds independently of one another, a given ob- 
ject may be assigned to more than one class or it may remain 
unclassified. Therefore the values in a row no longer sum 
to 100%. Here we use thresholds of 0.8,0.2,0.8 for galaxy, 
quasar and star respectively. The quasar threshold is chosen 
to give zero quasar contamination and delivers a good com- 



pleteness of 62%. The star and galaxy threholds were also 
chosen from inspecting the C&C curves in Fig. 1131 and yield 
high completeness (around 99%) and about 0.7% cross con- 
tamination. From a first glance at the table we might think 
that this star sample is heavily contaminated by quasars 
(26.37%). Yet we must remember that quasars are effec- 
tively rare, so the fraction of quasar contaminants expressed 
as percentage of objects in the star output is 



100% x 



26.37 x 0.001 

26.37 x 0.001 + 98.69 x 1.0 + 0.64 x 1.0 



= 0.03% 



(The quasar contamination of the galaxy sample is even 
smaller) . Recall also that the histograms, such as Fig. 1121 
show the actual number of objects in the test set, not the 
effective number. So what looks like a relatively large num- 
ber of quasars confidently classified as stars corresponds to 
a far smaller number in the target population. 

It is again interesting to identify the misclassifications. 
Those true stars which are most confused as quasars are 
show in Fig. 1141 As can be seen from the plot of the cor- 
responding library spectra (Fig. 1 1 5 P - four of these are hot 
stars with T e e ~40 000K. The other three are cooler (two 
around 4000 K, one around 7000 K) but have very high ex- 
tinctions (8-9 magnitudes Ay). This gives rise to the much 
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Figure 11. Completeness (blue/thick line) and contamination 
(red/thin line) of a sample as a function of the probability 
threshold. The bottom right panel shows the (logarithm) of the 
actual number of different types of class in the quasar sam- 
ple: stars (blue/solid line); quasars (red/dashed lines); galaxies 
(black/dotted line). Nominal priors, G=18.5, no low EW quasars 
in training data 

larger flux in the red which DSC confuses with a quasar-like 
broad emission line feature. 

We recall that in this experiment we removed the low 
EW quasars from the training set, but not the test set. 
We may, therefore, expect this model to perform particu- 
larly badly on these objects. However, if we plot Fig. [12] 
just for these objects it turns out to be broadly sim- 
ilar, although with about half as many objects in the 
P(quasar|lowEWQUASAR) = 1 bin and twice as many in 
the P(quasar|lowEWQUASAR) = bin. So while this model 
doesn't do as well on the objects omitted from the training 
data, it is still able to assign many a high probability. That 
is, DSC has been able to extend its recognition of quasars 
to lower EW cases than it was trained on. 

4-3.3 Testing the predictions of the modified model 

The C&C plots for the modified model are of course just pre- 
dictions of what the C&C would be if we applied the mod- 
ified model to a real data set which has the modified pop- 
ulation fractions. We can test these predictions by building 
a new test data set in which the population fractions really 
are (1,1/1000,1). To do this we took the original test data 
set but retained just a random selection of 1/1000 of the 
quasars (i.e. 60 quasars). We applied the modified model 
and calculated the C&C (now using equation [1] of course, 
not equation[5]). As there are only 60 quasars in this set, the 
predictions have high variance, so we bootstrapped the se- 
lection 20 times. Fig. [16] shows that the measured C&C are 
very close to the predictions, thus validating our approach. 



Figure 13. Completeness (blue/thick line) and contamination 
(red/thin line) of a sample as a function of the probability 
threshold. The bottom right panel shows the (logarithm) of the 
actual number (thick lines) of different types of class in the 
quasar sample (i.e. in the test set): stars (blue/solid line); quasars 
(red/dashed lines). The thin lines show the corresponding effec- 
tive number of objects for the modified case. There are no galaxy 
contaminants. Modified priors, G=18.5, no low EW quasars in 
training data 

4-3-4 The nominal model yields high contamination when 
quasars are rare 

Theoretical arguments aside, could we nonetheless use the 
nominal model to classify data sets in which quasars are 
rare? To assess this, we applied the nominal model to the 
few-quasar data set from the above test (section I4.3.3p . 
The measured C&C are plotted in Fig. [T7] as the dashed 
lines. These agree well with the C&C we would predict on 
the effective test data set (i.e. using equation [S} but using 
the nominal model, which are shown as the solid lines in 
Fig. 1171 While they agree, this figure shows that the sample 
contamination is far higher when using the nominal model 
than the modified model (Cf. Fig. [16}. That is, the modified 
model is superior at obtaining pure quasar samples. This 
demonstrates that we must take into account the expected 
class fractions (i.e. use a suitable prior) when applying 
a classifier to a data set in which we expect a class to be rare. 

Finally for this experiment, we note that the model-based 
priors agree quite well with the training data class fraction 
or the modified class fraction for the nominal and modified 
models respectively (Table [2}. 

4.4 G=20.0 with low EW quasars removed from 
the training data 

At G=20 the noise is higher, so we expect DSC to perform 
less well. The basic class assignment based on maximum 
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Figure 14. Spectra of stars which are most confused as 
quasars, namely those 7 stars which have the highest values of 
P mod (quasar | STAR). G=18.5, modified priors, no low EW quasars 
in training data. 




Figure 17. As Fig. 1161 but now for the nominal model 



Table 5. Confusion matrix for class assignments from maximum 
probability. Each row corresponds to a true class and sums to 100%. 
Nominal priors, G=20.0, no low EW quasars in training data 



co 
o 



CD 

d 



d 



eg 
d 



o 
d 




1 1 1 1 1 1 1 1 

300 400 500 600 700 800 900 1000 
wavelength / nm 

Figure 15. The (noise-free) input spectra for the stars shown in 
Fig. 1141 The photon flux has been normalized in each case to have 
a maximum of 1.0. 
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Figure 16. Comparison of the completeness (blue/thick lines) 
and contamination (red/thin lines) obtained with the modified 
model at G=18.5 (no low EW quasars in training data). The solid 
lines are the predictions (same as those in the the top right panel 
of Fig. 1131 1 and the dashed lines are the measurements from a data 
set in which quasars really are rare (class fractions of (1,1/1000,1)) 
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posterior class probability is indeed worse (Tabled}. Com- 
paring with Table E] we see that the stars suffer in particu- 
lar, with only 86% being correctly classified, as opposed to 
99% at G=18.5. This is reflected by the lower confidence the 
model gives to its classifications (Fig. [18}. Likewise, the com- 
pleteness is lower and the contamination higher at a given 
threshold (Fig. EE} . 
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Figure 18. Histograms of DSC outputs for each class showing 
how confident DSC is of identifying each class. Nominal priors, 
G=20.0, no low EW quasars in training data 
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Figure 19. Completeness (blue/thick line) and contamination 
(red/thin line) of a sample as a function of the probability 
threshold. The bottom right panel shows the (logarithm) of the 
actual number (thick lines) of different types of class in the 
quasar sample (i.e. in the test set): stars (blue/solid line); quasars 
(red/dashed lines). Nominal priors, G=20.0, no low EW quasars 
in training data 

If we now modify the priors and recalculate the poste- 
rior probabilities then we see a similar shift in the distri- 
butions that we did for the G=18.5 case (histograms not 
shown). The C&C plot with the modified model is shown 
in Fig. 1201 What stands out immediately is that we are still 
able to get a clean (zero contamination) quasar sample. This 
occurs once Pt = 0.07, at which point the completeness 
is 58% (and remains at above 50% if we choose to take a 
larger threshold). This compares to a completeness of 65% 
at Pt = 0.11 for zero contamination at G = 18.5. Thus we 
see that the higher noise in the data can be translated into 
a lower completeness (which is acceptable) rather than a 
raised contamination (which is not). Notice that there is no 
guarantee that the contamination should drop to zero before 
Pt — 1 (cf. section l4~2)l . so this is an important result. 

4.5 The impact of using astrometry in 
classification 

We may expect parallaxes and proper motions to provide 
a good discriminant between Galactic and extragalactic ob- 
jects. Yet it turns out that if we remove astrometry from the 
DSC, the performance hardly degrades (accuracy decreases 
by no more than 1%). This observation may not be surpris- 
ing, because the majority of objects with astrometry consis- 
tent with zero (within the measurement errors) are actually 
stars, and that is because our training sample is dominated 
by distant stars. It is also not specific to the SVM. We also 
built a two-stage classifier, in which (1) an SVM predicts 
classes based only on photometry and (2) Gaussian mixtures 



Figure 20. Completeness (blue/thick line) and contamination 
(red/thin line) of a sample as a function of the probability 
threshold. The bottom right panel shows the (logarithm) of the 
actual number (thick lines) of different types of class in the 
quasar sample (i.e. in the test set): stars (blue/solid line); quasars 
(red/dashed lines). The thin lines show the corresponding effec- 
tive number of objects for the modified case. There are no galaxy 
contaminants. Modified priors, G=20.0, no low EW quasars in 
training data 

are used to model the density in the 2D astrometric space 
and predict classes. The probabilities are the n comb ined to 
give a single posterior (Bailer- Jones & Smith (2008b)). The 
results are no better than an SVM using BP/RP and astrom- 
etry (although the two-stage classifier offers more flexibility 
and interpretability). 

4.6 The effect of reducing the fraction of quasars 
in the training data 

In section 12.3.21 we argued that the model class priors are 
not necessarily dictated by the class fractions in the train- 
ing data. Moreover, in attempting to match the training 
data distribution to that in the target population for a rare 
object, we may end up with very few objects in the training 
data, such that the classifier does not learn to classify them 
well. 

We test this by varying the fraction of quasars in 
the training data set, q, from 100% (equal class fractions) 
down to 0.1% (60 quasars). We then build samples based 
on maximum probability and measure the C&C (again us- 
ing equation [5] to account for the modified class fractions). 
Fig. [21] shows how the completeness and contamination of 
the quasar sample varies with q. For modest reductions in q 
the contamination is 5%, which is too high for a pure sample 
(although the samples are built using maximum probability, 
so we could reduce this at the cost of reduced completeness) . 
Surprisingly, the contamination actually drops to zero once 
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Figure 21. Quasar sample completeness (thick blue line) and 
contamination from stars (thin red line) and galaxies (dotted 
black line) as a function of varying the fraction of quasars in 
the training data set (the first two values are 0.1% and 1%) 

the quasar fraction drops below 5%. However, this comes 
at the price of greatly reduced completeness of just 0.1% at 
q =0.1%. This we can directly compare to the results from 
the modified model in experiment (2) (section |4. 3. 2[) , where 
the completeness is still 62%. Thus we conclude that prun- 
ing the training data set is not the way to achieve a good 
classifier for a rare class. 



5 DISCUSSION 

5.1 Training data distributions and class 
definitions 

Any supervised method is limited by the data it is trained 
on. Systematic differences between these and the target pop- 
ulation can compromise the classifier. It is not the goal of 
this article to present the final classifier for Gaia, so we have 
not yet optmized the training sets. Nonetheless, our work 
does raise some issues relevant to this important next step, 
which we briefly discuss. 

First, we found that removing the low EW quasars was 
important for achieving clean quasar samples in the mod- 
ified case. This may reflect a limitation of the SVMs, but 
it raises the general issue of how to set the training data 
distribution. We do not expect a supervised method to per- 
form well on regions of the data space not included in the 
training data (although we did find that the model could 
correctly classify many low EW quasars). For this reason, 
DSC will be preceeded by an "outlier detector" in the clas- 
sification pipeline. This assesses how well an observed object 
is "covered" by the training space and only passes "known" 
objects to DSC. It could be based on density estimation (in 
low dimensions) or one-class SVMs. 

Second, if we are building a classifier only to look for a 
specific type of object, then we don't need to aim for com- 
pleteness in the training sample (although we do need to 
include potential contaminants). For example, none of our 
quasar simulations include interstellar extinction (unlike the 
stars and galaxies) . This could be motivated by saying we do 
not trust extinction laws and just want to find unreddened 
quasars. We are free to make this choice at the cost of re- 
duced completeness of real quasars. If we included reddened 



quasars in the test set, then either they would be classified 
as quasars anyway (in which case the completeness goes up), 
or they would be classified as something else. Either way, the 
quasar contamination would not increase so our conclusion 
about building pure samples of unreddened quasars is unaf- 
fected. Incidentally, the results in section|3]are very similar if 
we repeat the experiments but now with interstellar extinc- 
tion applied also to the quasars. The main difference is that 
the quasar sample completeness is reduced by around 10%, 
with the more reddened quasars now contaminating the star 
samples. The quasar contamination is not increased. 

Third, how should we define the classes? If we had split 
the quasars into two classes (e.g. high and low emission line 
EW) maybe this would help the SVM to retain zero con- 
tamination on the high EW class without having to remove 
the low EW quasars entirely from the training data. This 
will be the subject of future work. 

Gaia does not exist in a void; we of course have in- 
formation from other catalogues/surveys to help us identify 
quasars. The full DSC is actually a multi-stage algorithm, 
in which the present paper just describes the stage based on 
the BP/RP spectrum. Other stages provide class probabili- 
ties based on other data, for example the source magnitude 
or Galactic latitute or external catalogues, in which case we 
might call them "priors". We will describe this multi-stage 
approach in a future paper. 

5.2 How do we interpret posterior probabilities? 

When we build a sample by applying a threshold, all of the 
objects have a posterior probability which is at least as high 
as the threshold. For example, with the modified model in 
section [4.3.2L we were able to set Pt = 0.11 and produce a 
sample of only quasars. There is of course no guarantee that 
all samples built using this threshold have no contaminants. 
On the other hand, this threshold is not a statement about 
the sample as a whole: it would be wrong to expect only 11% 
of the sample to be quasars. We need to look at the actual 
proabilities, not a lower limit. How can we use these? Imag- 
ine a case in which we set Pt = 0.90 and thereby obtained 
a sample of 1000 (supposed) quasars, all which happen to 
have P(quasar) = 0.95. We would reasonably expect 5% 
not to be quasars. Yet we must be careful with such an in- 
ference, because posterior probabilities are not frequencies 
(especially if we have very non-uniform priors, e.g. class frac- 
tions of (1,0.001, 1)). 

How can we make statements about the sample as a 
whole from the individual probabilities? The general case is 
complex, but for the case that all the probabilities are the 
same, the solution is simple. If P(quasar) = p for all objects 
in a sample of size n, then the probability that exactly r of 
them (r ^ n) are non-quasars is 

p(n ' r ' p) = H^7yi (1 " p)V "" r) (7) 

The left panel of Fig. 1221 shows how (the logarithm of) this 
quantity varies with r for n = 1000 and p = 0.95. (It has 
a maximum of P(n,r,p) = 0.058 at r = 50). By summing 
equation [7] over ranges of r we can make more useful state- 
ments. For example, summing from r — to r — 50, we can 
say that the probability that there are 50/1000=5% con- 
taminants or fewer is 0.54. Equivalently, the probability that 
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Figure 22. The left panel shows the variation with r of P(n = 
1000, r,p = 0.95) (equation , the probability that there are 
exactly r contaminants in the quasar sample of n objects, each 
with posterior quasar probability of p. The right hand panel is 
1 minus the cumulation of this (equation [8} , and so shows the 
probability that there will be more than r contaminants (shown 
as the contamination fraction r/n). 



ination in a sample where quasars are rare (dashed lines in 

Fig. ED . 

We demonstrated (in section I4.3.4[) that the applica- 
tion of the nominal model to a data set with rare quasars 
gives poorer results (larger contamination) than the modi- 
fied model at a given probability threshold. It is not yet clear 
whether we could just use the nominal model but with much 
higher probability thresholds to achieve similar results. How- 
ever, there are two reasons why we should not want to do 
this. First, as we must not build test data sets with very 
small numbers of the rare class, we would still have to use 
the modified class fractions in order to correctly predict the 
C&C (section 12. 5p . Second, we want models which deliver 
actual probabilities, not just numbers between 0.0 and 1.0 
to which we apply a meaningless threshold. This is especially 
important if we later want to update the probabilities, e.g. 
in a multi-stage classifier. We therefore believe it is better 
to use the modified model. 



there is more than 5% contamination is 0.46. The right hand 
panel of Fig. !22l shows this for variable r. Specifically it plots 

r — r 

l-^P(n,r' ; p) (8) 

r'—0 

against r/n, the contamination fraction. This figure agrees 
approximately with our intuition of a 5% contamination 
when all posterior probabilities are 95%. 

We can apply this to the case in section 14.3.31 where 
we applied the modified model to a test with 60 quasars. 
Setting P t = 0.2 we achieved a zero contamination sample 
of 37 quasars (62% completeness; Table|4]). If we didn't know 
the true classes, would we have expected zero contamination 
with the above calculation? The posterior probabilities are 
not equal (the expected distribution is P(quasar| QUASAR) 
in Fig. I10p . but we approximate by setting p equal to the 
average for this sample, which is 0.95. Equation [8] tells us 
(n = 37, r — 1, p — 0.95) that the probability of having 
one or more contaminants is 0.56. This is consistent with 
having no contaminants. Of course, this average probability 
is not very representative, so this doesn't strictly apply. For 
example, we could also have taken a threshold of P t = 0.50 
and still had a zero contamination sample, but in this case 
the average is p = 0.997 and the probability of one or more 
contaminants is just 0.006. 

Despite these approximations, it appears that the infer- 
ences about the sample as a whole are consistent with the 
individual posterior probabilities. 

5.3 Why we should prefer the modified model 

Ultimately we would like to know whether the modified 
model is "better" than the nominal one. If we ignored the 
issue of priors and class fractions, then our predictions of 
sample completeness and contamination would always be 
given by Fig. QT] (at G=18.5), regardless of the true class 
fractions in the target population. If we improve this by in- 
stead using equation [5] to accommodate the modified class 
fractions (but still with the nominal model posteriors) , then 
our predictions would be the solid lines in Fig. 1171 Yet both 
are poor predictions of the actual completeness and contam- 



6 CONCLUSIONS 

We have introduced a method of probability modification 
which can be used to build clean samples of rare objects for 
which the completeness and contamination can be reliably 
predicted. We have demonstrated this using a support vector 
machine classifier, although it may, of course, be used in 
conjunction with any classifier which gives probabilities. The 
main conclusions of our work are as follows. 

• To construct a pure sample of objects we should use 
a probabilistic classifier and only select objects with high 
probabilities. By varying this probability threshold we can 
trade off sample completeness and contamination. 

• To achieve pure samples of rare objects, we must take 
into account the expected class fractions in the target popu- 
lation, which act as a prior probability on the classifier. We 
use these to modify the nominal classifier outputs to give 
the modified model. 

• We applied our modified model to a three class problem 
in which quasars are simulated to be 1000 times rarer than 
stars and galaxies. We can achieve a pure quasar sample 
(zero contamination) yet still reach a sample completeness 
of 50-65% for magnitudes down to G=20.0. Although the 
test set is finite in size, this correponds to an upper limit in 
the contamination of 1 in 39 000. This is more than adequate 
for establishing an astrometric reference frame for Gaia: If 
Gaia observes half a million quasars, we can build a quasar 
sample of 250 000 with no more than 13 contaminants. 

• While achieving this pure quasar sample, we simul- 
tanesouly achieve very complete galaxy and star samples 
(both 99%) with low contamination (both 0.7%) (figures for 
G=18.5). 

• These results were achieved after removing quasars with 
low equivalent width emission lines from the training sample 
(defined somewhat arbitrarily as 5000 A). Including these 
precluded establishing a low contamination sample, because 
it resulted in cool (4000-8000 K) highly reddened (A v = 8- 
10) stars being confused with them. After removing these 
quasars, the first stars to contaminate a quasar sample (if 
we set a low threshold) are these reddened cool stars as well 
as hot stars (T cff > 40 000 K). 
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• Including parallax and proper motion (either as addi- 
tional SVM inputs, or in a separate mixture model classifier) 
hardly changes the performance. This is not surprising since 
the majority of objects with astrometry consistent with zero 
are actually stars. 

• Extending the training and testing sets to include 
quasars with a full range of interstellar extinction does not 
significantly alter the results (completeness slightly lower, 
but contamination unaffected) 

• All classification models have a prior, but the prior is 
often not explicit and are sometimes implicitly influenced by 
the training data distribution. We have introduced a simple 
method for calculating the implicit priors in a classification 
model, which we call the model-based priors. In many cases 
we have experimented with, these priors are close to the 
true class fractions in the training data (nominal model) or 
modified class fractions (modified model). 

• We recommend that a classifier be trained on roughly 
equal numbers of objects in each class so that it can properly 
learn the class distributions or boundaries. By determining 
the model-based priors and replacing them with something 
more appropriate to the target population (e.g. quasars be- 
ing rare), we can produce a modified model with superior 
performance. In particular, this is far better at producing 
large, pure samples of the rare class. 

• With our approach we can apply the model to any new 
target population by specifying the appropriate class frac- 
tions (priors) without having to change the training data 
distribution or re-train the model. 
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